Income Prediction using a Census Regression Model

This directory contains a series of notebooks that demonstrate building a model to predict income using US census data.

The model will be built using TensorFlow and the Google Cloud Datalab Machine Learning Toolbox, which contains out-of-the-box models that can be easily applied to your data. In this sample, a regression model (a model that can be used to predict a continuous value) will be used. The specific type of regression model chosen for this sample is one that is implemented as a deep neural network.

The notebooks will use Google Cloud Machine Learning Engine to submit training jobs to train the model, and to deploy the resulting model for predictions. The notebooks will also use Google BigQuery and Google Cloud Dataflow at appropriate points in the sample.

Census Data

The sample uses the census/population datasets from the American Community Survey.

A copy of this data, extracted from the original zip file has been copied to Google Cloud Storage as a publicly accessible object at gs://cloud-datalab-samples/census/ss14psd.csv.

Workflow and Notebook Guide

A typical machine learning workflow spans multiple stages, and is iterative in nature, starting with data preparation and exploration, continuing on to training and model evaluation, and then deployment before the model is used in applications or other data pipelines to produce predictions.

While practicing this workflow, the recommendation is to work with sample data and develop the model in the local Datalab environment before launching large scale jobs on Cloud. On Cloud, the services are optimized for large scale, which is essential for completing the task and harnessing the value of the entire dataset, but this can introduce hurdles in quick develop-test-validate iteration, as well as introduce latencies that get in the way of iteration on sample data.

The notebooks in this directory reflect this workflow.

  1. Local End to End - demonstrates the end-to-end development workflow, running locally in the Datalab environment.
  2. Service Preprocess - demonstrates data preparation and data analysis using data in Google Cloud Storage and BigQuery.
  3. Service Train - demonstrates training a model using Machine Learning Engine.
  4. Service Evaluate - demonstrates evaluating the resulting model using Dataflow and BigQuery.
  5. Service Predict - demonstrates deploying the model to Machine Learning Engine and using both online and batch prediction.